SEPARATION OF GLOTTAL WAVE AND VOCAL TRACT 
TRANSFER FUNCTIONS BY SUCCESSIVE ITERATION 


By 

Maj. TARANJIT SINGH 


EE 

iggi 

n 

51 N 
3EP 



department of electrical engineering 
INDIAN INSTITUTE OF TECHNOLOGY KANPUR 

FEBRUARY 1991 



SEPARATION OF GLOTTAL WAVE AND VOCAL TRAQ 
TRANSFER FUNCTIONS BY SUCCESSIVE ITERATION 


A Thesis Submitted 

in Partial Fulfilment of the Requirements 
for the Degree of 

MASTER OF TECHNOLOGY 


By 

Maj. TARANJIT SINGH 


to the 

department of electrical engineering 

INDIAN INSTITUTE OF TECHNOLOGY KANPUR 

FEBRUARY 1991 






la & * • ^ «M8^ 






CERTFICATE 

This is to certify that the thesis entitled 
'Separation of Glottal Wave and Vocal Tract Transfer Function 
by Successive Iteration' is a record of work carried out under 
my supervision by Ma 3 Taranjit Singh and to best of my 
knowledge it has not been submitted elsewhere for a degree. 



/ « *■ 

February, 1991 ( Dr. G.C Ray ) 

Assistant Professor 
Department Of Electrical Engineering 
Indian Institute Of Technology 



KANPUR 208016, INDIA 



dedicated to 


my mother Sardami Gurdev Kaur, and 
father Sr Sadhu Singh in their everlasting and loving memory 

and 

' « ? 4 ... 

my wife Ravi 
and 

children Areet Kaur and J. Karan Singh 



ACKN0l4.EDGE>O^ 


I tvish to express my sincere thanks to 

- Dr. 6 C Ray/ my thesis supen^^isor/ for his excellent guidance/ 
wholesome stimulus and encouragement and aboi^e alh his i/igourous and 
considerate approach. 

— Miss Seshu K/ project associate u^ho proi/ided all the necessary help 
in my early part of hardu'are fabrication. 

— my wife Rai^i for her sacrifices and facing the hardships caused by 
my long absence from household chores help. 


Maj Tar an jit Singh. 



SYNOPSB 


A speech signal Sr(z) of a sustained vowel can be 
represented in frequency domain as the product of 
P ( z ) • G ( 2 ) . V ( z ) . R ( z ) , where P(z) is the transfer function of 
train of impulses, G(z) is the gllotal wave transfer function, 
V<z> is the vocal tract transfer function and R(z) is radiation 
load transfer function. Mathematically, 

Sr (2) = P(z).G(z).V(z).R(z). 

The radiation load component of speech signal is due 
to conversion of the volume velocity of sound coming out of 

lips into the pressure waves recieved at the microphone placed 
in distant field. This conversion is in the form of a 

differential relationship between volume velocity and sound 
pressure. To remove the effect of this component, a process 
called the Inverse Filtering using a digital integrator had 
been suggested by many authors but in the present case a 

conventional analog integrator was used whose performance is 
very similiar to the digital integrator <of the f orml/ i— az~^> 

suggested by them. The speech signal then can be represneted as 
S < 2 )=P( z ) .G( 2 ) . V ( z ) or equivalently S( z )=P( z > . H ( z ) 
where HCz)=G(z).V(z) and S( z )=Sr<z)/R(2) 

P(z) component of S(z) was separated out using a technique 
called Homomorphic Deconvolution. This removal of P(z) was 
carried out in the laboratory using a FFT Analyser. In other 
words, H(z) was recovered from S(z) and was transferred to PC 



through « GPIB intsrfaco. 

The G(z) and V<z} components are found to be 
superimposed in frequency domain and are therefore difficult to 
separate. Their separation is what the thesis has endeavoured 
to achieve. L R. Rabiner and Ronald M. Schafer showed that a 
lossless vocal tract system divided into N identical sections, 
oan be characterised by a set of its area functions or, 
equivalently, reflection coefficients. Mathematically V«(Z), 
the vocal tract transfer function is 

0.5 (Ur.>'fl ^l-frJz""''* 
y^(x) ffl 

1 - ^^<***'* 

Using this V«(z) the glottal wave transfer function 
Ga(z> was obtained by simple mathematical division of H(z} by 
Va(z) in the frequency domain. The G|,(z) so obiain^r^ was 
represented by synthetic glottal wave transfer function and was 
utilized to separate the individual V,(z) from H(z}. By their 
successive iteration and using the relation betwci n linear 
predi clt.r coefficients and PARCORS, the area functiens cf the 
individual vocal tract were determined from Vi(z). This helped 
in reconstruction of individual vocal tract. This 
recons true- Li tin is of immense help in diagnosis of pathological 


disorders . 



CONTENTS 


Page 


Chapter-! ; I ntroduction i-27 

Chapter-2 : Homomorphi c Deconvolution 28-4! 

Chapter-3 : Success! ve Iteration 42-48 

Chapter-4 : Li near Predictve Coding 49-54 

Chapter-5 : Losses in Vocal Tract 55-6! 

Chapter-6 : Concl usi on and scope for further work 62-63 


References 


64-66 



ugr CFSY bei^ 


VaCzD 

ViCz3 

Ga[z3 

$il23 

Vt23 

P[z3 

R[z3 

GCz3 

Sr[z3 

SCz3 

Htz3 


Average area vocal tract transfer fijnotion 
Individual area vocal tract transfer function 
Average area glottal wave transfer function 
Individual area glottal wave transfer function 
Vocal tract transfer function 
Excitation pulse transfer function 
Radiation load transfer function 
Glottal wave transfer function 

Transfer function of the complete speech model 
Transfer function speech model without radiation 
load 

Transfer function of speeoh model after removal of 
P[z3 and R[z3 

I'TAjsu.Ue. o^glo'tta.L f-ilVis.)" 



CHAPTER 1 


INTRODUCTION 

Speech is a peculiar human activity not endowed 
to other species. Speech is also a primary means of 
communication between people. In order to understand this 
activity it is essential to know the fundamentals of speech 
production process. Developement of good digital processing 
technique has made the processing of speech signal in real 
time feasible. Specifically we shall be concerned with 

digital signal processing technique to study the steady 
state behaviour of the vocal tract system during the 
production of single sustained vowel [43. 

However before the objective of the thesis is 

outlined, it is imperative that the process of speech 
production and mechanism of speech production is explained 
briefly . 

1.1 THE PRQCEBS OF SPEECH PRODUCTION 

Speech signals are composed of a sequence of 

sounds. These sounds and the transition between them serve 
as a symbolic representation of informtion. Most languages 
including English can be described by the distinctive bits 
of sounds called phonemes . In American english there are 
about 42 phonemes including vowels; diphthongs; semivowels; 
and consonants. Due to limit on the rate of physical motion 



2 


human articulators produce speech at the rate of 10 
phonemes/sec. If we represent each phonemes by a set of six 
bit binary numbers average information rate will thus be 60 

bi ts/sec C 1 , 43 . 

AIM 

Speech production mechanism suggests that 
if individual vocal tract parameters are extracted from 
speech signals, they could provide important advantages. 
These are; 

(i) They provide useful clues about the defective 
vocal tract of a pathological source so that it can 
be diagnosed and treated for correct speech 
ut terence . 

(ii) Identification of speaker for his. secrecy. 

(iii) Automatic speaker recognition. 

1.1.1 MECHANISM OF SPEECH PRODUCTION 

Figs 1.1 and 1.2 show the important features of 
human mechanism that constitute production of speech 

signals Cl, 43. The vocal tract begins at the opening 

K/ 

between vocal chords and the glottis and ends at lips. The 
vocal tract itself consists of pharynx < the connection 
from esophagus to the mouth ) and the mouth. In an average 
adult male the total length of vocal JSrd is 17 cm 

approximately. The cross sectional area is determined by 
the positions of tongue; lips; jaws; and velum. If divided 
into N equal sections the area of the vtS'al tract will vary 

A 




CORDS 


9 - 

FIG. 1.1 SPEECH PRODUCTION MECHANISIM 



FIG 1.2 SCHEMATIC REPRESENTATION OF VOCAL SYSTEM. 



4 


at aaoh saction. Ua obsarva that vocal tract is a circular 
with diffaring araas. Tha araa can vary from zaro 
indicating complata olosura to about 20 cm^. Tha nasal tract 
or nasal cavity bagins at tha valum and ands at tha 
nostrils. Uhan valum is lowarad, the nasal tract is 
aocoustioally coupled to vocal tract to produce nasal 
sounds of speech. 

Tha subglottal system cosisting of lungs, bronchi 
and trachea, serves as the source of energy for production 
of speech. Tha speech is an accoustic wave that is radiated 
from this system when air is expelled from tha lungs and 
resulting flow of air is perturbed by a constriction isome 
where in the vocal tract. [i,21 

For getting the required quality of sound the vocal 
chords' tension must change. When these vocal chords which 
are muscle fibres, are tightly closed, pressure is built up 
below it. When certain threshold level of pressure, 
depending upon tensions in vocal chords and sufficient to 
force vocal chords to open, is built up, a narrow glottis 
opening is created between them. A jet of air rushes out 
with great kinetic energy and this due to Bernaulli's law 
causes the fall of potential energy (and hence pressure) in 
the glottis. Vocal chords will close again costrioting the 
passage of air flow. This process is repeated 50—400 times 

in one second. Thus the vooal chords enter a state of 

0 

sustainerf oscillations. The rate of opening 



5 


and 

closing 

of 

glottis 

is controlled 

by 

the 

air pressure in 

the 

lungs , 

the 

tension 

and stiffness 

of 

the 

vocal cords and 


the glottal opening under rest conditions. 

1.1.2 Classification of speech 

Speech can be classified into three distinct 
classes according to the mode of excitation of vocal tract. 
Specifically , 

. Voiced sounds are produced by exciting the vocal 
tract with quasi periodic pulses of air flow 
caused by the opening and closing of glottis. 
The examples are: /u/, /d/, /i/ , /e/. 

• Fricative /Uni^oiced sounds are produced by forming a 
constriction some where in the vocal tract and 
forcing the air through constriction so that 
turbulence is created, thereby producing a noise 
like excitation. Examples are /f/, /©/, /sh/ 

etc . 


■ 

Plosive sounds 

are 

produced 

by completely 

closing 

off 


vocal tract , 

building 

up pressure 

behind 

the 


closure , 

and 

then 

abruptly releasing 

the 


pressure. Examples are /ts/ etc. 

In each case speech signal is produced by exciting 
the vocal tract system with wide band excitation. The 
different sounds then produced depend on the the 
constriction of vocal tract which is a tube having 
different area functions in different sections along its 



6 


length. Thastt areas vary front parson to person. Also, vocal 
tract changes shape relatively slowly with tinse and this 
can be modelled as slowly time vesr^lng filter that imposes 
its frequency response properties on the spectrum cf the 
excitaticn. 

The utterrence of a vowel is voiced sound. The 

variation of the area at each section along the length of 

the vocal tract determines the rescnant frequencies or the 

formants of the speech signal . The dependence of cross 

sectional area upon the distance along the tract is called 

the area function of the vocal tract and it is this area 

function which determines the shape of the vocal tract [S3. 
1.2 MAKING A MODEL OF SPEECH 

Having discussed briefly the mechanism of speech 
production, it will be better if we consider mathematical 
representations of speech production and make a model of 
speech signal . 

1.2.1 Glottal Excitation 

The glottal excitation for voiced sounds ( in our 
case a sustained vowel ) is the appropriate source for 
vocal tract excitation [163. As brought out earlier, the 
rate of opening and closing of glottis is controlled by the 
pressure in lungs, the tension in vocal cords and the area 
of glottal opening under rest condition. In other words, we 
can say that the glottal excitation is a sequence or train 
of impulses which arm spaomd by the desired 



7 


fundamental period for a short interval segment of speech 
signal. A digitized model of glottal wave is shown in Fig 
i.3. This wave shape has two parts namely, the open glottis 
having rising phase and the falling phase and closed 
glottis. The frequency response of such a wave shape for 
ideal values of n^ and n^ is also shown in Fig 1.4 . 




FIG 1.3 IDEAL GLOTTAL WAVE FORM. 


Mathematically, each phase can be described as 

(a) Rising Phase of open glottis: 

A 

■ Sj^Cnl = SIEO . 5[ 1— cosuRnl , n=0,i,2. ... 02 .1.1 

where ur the pulse rise frquency = n/na. 

(b) Falling phase of open glottis: 

A 

SjCnl = S*C K . cos (Upn— It ) — K+1 3 ..1.2 

where n = nad), n 2 ( 2 ), . . . and 
K=steepness factor , ideal ly 0.5<k<10,000 
K can be calculated from 

= “4^ cos'\!^) ..1.3 



ZO loq,o'G(e' 






Lh} 


FIG i 4(a) IDEAL GLOTTAL WAVE AND 

' (b) FREQUENCY RESPONSE OF (a) 




9 


(o) Closed glottis: 

SaCnl ■ 0. 

n is the number of semples which oharecterize e modelled 
glottal wave shape. 

The values of and had been particularly 

useful) as we shall see later (Chapter S), in getting and 
shaping the individual glottal wave. 

1.2.2 Excitation of vooal tract 

As brought out earlier, the excitation of vocal 
tract is caused by glottal excitation. It has also been 

i 

pointed out earlier that the vocal tract can be assumed to 
be a slowly varying filter that imposes its frequency 
response properties on the spectrum of glottal excitation. 
Mathematically therefore, G(z} and V(z) are superimposed on 
each other [7,213. 

1.2.3 Terminal analog model 

Having discussed the process of speech production 
mechanism in sufficient details, a simple source filter 
model as shown in Fig 1.5 can now be made. The vocal tract 
is represented by a time varying filter. Excitation source 
is a quasi periodic impulse train or excitation generator. 
Amplitude control regulates the energy output. Various 
parameters for vocal tract filter, voiced/unvoiced switch, 
pitch period and amplitude are regularly updated so as to 



iO 


PITCH UPDATE 



c^; 





FIG 1.5(a) SOURCE FILTER MODEL, AND 

(b) EQUIVALENT TERMINAL ANALOG MODEL 






11 


keep track of variations in the speech wavef orm . C 1 , 2 , 4 3 
These parameters vary slowly. 

But actually, in digital model of speech signals, 
the mode of excitation and resonance properties of linear 
system must change with time. To have a discrete time 
model, it is instructive to represent the various system 
involved in sound production with their transfer functions. 

Rabiner and SchaferCll showed that vocal transfer 
function can be calculated from average area functions or 
equivalently, reflection coefficients. An average male 
vocal tract was divided into N equal sections. The relation 
is 


H 

0.5( 1-f-rs )fl ( 1-^ }z~ 

Viz) = 


. . 1.4 


The 

as 


where rn 
N 
Tq 
of 

reflection 


is the reflection coefficient of k*^ section 
is the no. of sections of vocal tract 
is the glottal wave reflection coefficient 
lossless tube, 

coefficients and area functions are related 


_ Afc+i- 

~ A * A 


.. 1.5 


where A^ and Au+t are average area functions of 

adjacent sections of vocal tract respectively. 

The vocal tract transfer functions of certain 



12 


vowels along with their area functions and reflection 
coefficients is shown in Figs 1.6 and 1.7. 

1.2.4 Radiation 

So far we considered the transfer function V(z) of 
a vocal tract which required a configuration in which 
volume velocity changes occur at lips with the 
corresponding pressure changes at the glottis Cll. In 
electric analogy, medium of air is considered to be short 
circuit because of no impedance. The accoustic part of a 
short cicuit is difficult to achieve as an electrical 
circuit because it is difficult to achieve a required 
configuration in which volume velocity changes will occur 
at lips without corresponding changes in pressure. However 
a reasonable model as depicted in Fig 1.8 can be made which 
shows the lip opening as an orifice in a sphere. At low 
frequencies opening can be considered as a radiating 
surface with the sound waves being diffracted by spherical 
baffle that represents the head. For ’ determining the 
conditions at the lips all that is needed is relationship 
between pressure and volume velocity at the radiating 
surface. This is very complicated for the configuration of 
Fig 1.8(a). However if the lip opening (radiating surface) 
is small compared to size of sphere, a resonable 
approximation assumes that the radiating surface is set in 
baffle of infinite extent as shown in Fig i.8(b). It can be 
shown that the sinusoidal steady state relation between 



13 




0 1 2 3 4 5 

FREQUENCY (kHz) 

(C) 


FIG i.6(a) AREA FUNCTION FOR 10 SECTION LOSSLESS TUBE 

TERMINATED WITH REFLECTION LESS SECTION OF AREA 10 CM^^ 
(b) REFLECTION COEFFICIENT FOR 10 SECTION TUBE; 

(c> FREQUENCY RESPONSE OF 10 SECTION TUBE; SOLID CURVES 
CORRESPOND TO SHORT CIRCUIT TERMINATION. k'OWEL./a/. 






20 109,0 luu.n)/u,(n)| A(,)(cm*) 


14 


to 


5 


0 

GLOTTIS 


15 


10 

DISTANCE X (cm) 


LIPS 


20 






S 

5 


(b; 


FORMANT 

FREQUENCY 

BANDWIDTH 

1 ST 

232 0 

60 7 

2ND 

5965 

572 

3RD 

23949 

65 9 

4TH 

3849 7 

42 5 



1000 


2000 2000 
FREQUENCY (Hz) 


4000 5000 


FIG 1.7(a) AREA FUNCTIONS OF VOCAL TUBE AND 

(b) FREQUENCY RESPONSE OF (a) FOR k'OWEL /u/y/f/ 




15 


complex amplitudes of pressure and volume velocity at the 
lips is through a radiation impedance or radiation load. 
Further, due to conversion of volume velocity irito sound 
pressure at the microphone placed at the distant field 
because of differentiation of volume velocity, a genuine 
and comprehensive model can only be made if we cosider the 
radiation load. Hence if we wish to obtain a model for 


pressure at the 

lips. 

as 

is the case 

now , 

the effects of 

radiation load 

must 

be 

considered . 

The 

glottal voltime 


velocity and sound pressure for a particular vowel "are 
depicted in Fig. 1.9. [163 

1.2.5 Losses in vocal tract 

So far the model we considered assumed no energy 
loss in vocal tube. In reality," energy "will ’be lost as a 
result of 

a) viscous friction between air and the walls 

of tube, 

b) heat conduction through walls of tube, and 

c) vibration of tube walls. 

The effects of wall vibration is caused by 
variation of air pressure inside the tract causing walls to 
experience a varying force. Thus, if walls are elastic, the 
cross sectional area of the tube will change depending upon 
pressureCl , 23 . This effect results in addition of wall 
admittance Y in Eq 1 ."4 and consequent changes in reflection 
coefficients which were earliar real quantities. 



16 



FIG 1.8 (a) RADIATION FROM A SPHERICAL BAFFLE; 

(b) RADIATION FROM A INFINITE PLANE BAFFLE. 



FIG 1.9 GLOTTAL VOLUME VELOCITY AND SOUND PRESSURE 
AT THE MOUTH FOR KOWEL /a/. 



17 


The effects of thermal conduction and viscous 
friction of the wall are much less pronounced for 
frequencies below 3—4 KHz where as wall loss is more 
pronounced at these frequencies. Nevertheless, effects of 
wall vibration needed to be considered [61. The model of 

speech mechanism however needs no alterations. Details are 
discussed in Chapter 5. 

1.2.6 A complete model 

A complete model finally can now be percieved 
to formulate the problem and achieve the declared objective 
of the thesis. Such a model is shown in Fig 1.10. It is 

convenient to combine the impulse train, glottal pulse, 
radiation and vocal tract components all together and 

represent them as a single transfer function Sr ( z > as 

Sr(z) = P(z).G(z).V(z).R(z). or equivalantly , ..1.6 

Sr(z) = S(z).R(z). ..1.7 

Here Sr(z) is the transfer function of the complete model 
that takes into account the radiation load also. S(z) is 

the transfer function of model which does not include the 
effect of radiation. G<z).V(z> itself can be represented 
by a single transfer function H(z). Thus a complete model 
can be represented as 

Sr<z) = P(z).H(z).R(z). ..1.8 

1.3 FORMULATION OF PROBLEM 


After the brief introduction to speech production 



18 


mechanism the stress of the thesis is now to approach its 
objective. Hence it was extremely important that the 
problem be formulated in a simpler way and tackled 
effectively. 

1.3.1 Objective 

Objective of the thesis is to separate the vocal 
tract transfer function and glottal wave transfer function 
by successive iteration. 

1.3.2 Removal of radiation load component 

It may be recalled that radiation component is a 
low impedance load which terminates at the vocal tract; the 
volume velocity of air flow at the lips < and the nose ) is 
converted into sound pressure in distant field which is 
approximately the derivative of volume velocity at lips. 
Therefore what we wished to achieve was a model of speech 
for pressure at lips to retain the originol identity of 
speech signal. To reconstruct the speech signal at lips, 
this radiation load component has to be removed from speech 
model. A sort of filter has to be applied whose transfer 
function will revert the influence of radiation component. 
Referring to Eo- i.7, we can write the radiation load 
transfer function as 

In a first approximation, which is valid for lower 


frequencies where the wavelength is large compared to 



9 



FIG i.iO GENERAL DICRETE TIME MODEL FOR SPEECH PRODUCTION 



FIG i.il PRINCIPAL OF WORKING OF ANALOG INTEGRATOR 












20 


diameter of mouth opening, this conversion involves 

differentiation causing a zero at zero frequency. In the 
inverse filter this zero is reverted by an integrator 
component i.e, by a first order recursive filter with a 
pole at or near z=i . Wolfgang Hess [121 suggested the 
radiation load as 


R(z) = 1- 0.995 z or 


. . 1.10 


S(z) = 

i-0. 995 z“^ 


. . 1.11 


but we used a conventional analog integrator whose 
performance and results differed marginally from the filter 
suggested by them. The speech signal output from analog 
integrator and inverse glottal filter for some of the 
vowels are shown in Figs 1.12 and 1.13 for comparison. This 
integrator has switching device with it to initiate and 
terminate the period of integration. 

^ Fig 1.11 illustrates the functioning of this 
integrator [261. In reset mode if switch Si is closed, 

initial conditions are established by placing an initial 
charge on the capacitor. This also allows the output 

voltage to rise to negative of V|c . If switch Si is then 

opened and switch S 2 is closed, the circuit begins the 

integration of input signal ei beginning at the value —Vic 
An astable multivibrator coupled 



fig OMQinol speech foi newel /a^/ 

E-3 


MI SC MENU 
FUNCTION 
FILE 


BU2ZEP 

on 



-177 

E-3' 


0 


‘ 0 O.OOOiy 

ui'bEO'PEINT 


peaP 

■6>8i?Zo.fail 


time ’ • mZ ’> ’ xi "■ ':-5 

'■'.341 aiS 0 . UcBOSZOO' ■ U 


uideo print 


Fiq 1 . Integrated speech signal for uowel /ait/. Ml SC MENU 

' : FUNCTION 

i FILE 

: BU2ZEP 



X8 


'O TIME”"''"«s”V"’ XI 25 

ref 14’ 464'‘«S 2.’?32^30 ’l) 

.1 j v/C dc .".j BiS 0- IG'!! 5 j r 


print 


FIG l.i2(a) ORIGINAL SPEECH SIGNAL FOR KOWEL /aw/; 

(b) INTEGRATED SPEECH SIGNAL OF (a) USING ANALOG INTEGRATOR. 



22 



Fici LOWER 

2 OKH 2 Int 



fl 30dB dc 
B OdB dc 
Free 
Src R 
Leuel 
0.000 
Slope + 
Position 
0.5000 
Delay off 
l-OOE-ci 
Windw Hann 
Filter on 


X 8 


- 5 , 5 ^' 


0 


0 ■ 0.0000 
U I DEO "print" 


tIme’c ■ "’xi"" 

peal IB^o? «S ' ' ‘ 5.42l5:-0 

jCi/BlPZ'pudl 1 ■_ ; ■ ■ 

Q>> 


Efi l.OOE+0 
EB l.OOEtO 
EC l.OOE+0 

II 

;uideo print 


FIG l.i3(a) INTEGRATED SPEECH SIGNAL FOR rOWEL /aw/ USING 
DIGITAL FILTER ; 

(b) INTEGRATED SPEECH SIGNAL FOR rOWEL /ae/ 
USING ANALOG INTEGRATOR. 



23 


with a CD 4066 has been used as a switching device. 

A band pass filter to pass 50— 4000Hz frequency 
has been incorporated before the integrator. Speech signal 
was passed through this BPF to integrator. The integrated 
signal was then fed to FFT Analyser for its further 
processing . 

A block diagram of hardware used and its detailed 
circuit is shown in Fig. 1.14. 

Speech signal S(z) after the removal of 

radiation load component R(z), is the product of P<z) and 

H(z). Homomorphic Deconvolution technique as dicussed in 
Chapter 2 was applied to S(z) through FFT Analyser to 

recover H(z) from it and this H(z) could then be 
transferred to PC through GPIB interface. The separation of 
G(z) from V(z) in H(z) posed difficulties as their 

spectrums were superimposed on each other. 

Chapter 3 contains the successive iteration method 
employed to find out individual area vocal tract transfer 
function VjCz). For average male^ for certain vowel the 
vocal tract transfer function V^Cz) is given by the Eq . 1.4. 
Division of H(z) by this Va(z) resulted in determination of 
average area glottal wave transfer function Ga(z). Fig 1.15 
shows the wave shape of gatnl computed by IDFT of Ga(z). As 
is seen, waveform has high frequency components due to 
mismatch between individual and average area functions. The 
approximation of this glottal wave to the model glottal 



24 



SPEECH INPUT 



FIG i.l4Ca) BLOCK DIAGRAM OF HARDWARE; 

(b) DETAILED CIRCUIT DIAGRAM OF (a). 








Ga(n) 


25 


20.00 

10.00 

0.00 

- 10.00 

- 20.00 



0.00 100.00 200.00 300.00 


n 


FIG i.i5 GLOTTAL WAVE SHAPE g^EN] WITH HIGH FREQUENCY 
COMPONENTS SUPERIMPOSED ON IT. 



26 


waveshape can be achieved either through low pass filtering 
of gaCnl or by finding na and ns through Least Mean Square 
between individual and model glottal waveshape of Fig 1.4. 
The transfer function of individual glottal wave was then 
found out as G;(z). The resultant wave forms of giCnl are 
shown in Fig 1.16. Indvidual area vocal tract transfer 

function Vi(z) was found out by simple division of H(z) by 
Gi(z). This was the first step in iteration. Again this step 
is repeated taking this V;<z) as another V4<z) to find out 
another GiCz). This method is repeated several times untill 
most near approximation to ideal glottal wave shape is 

attained. The Vi(z) so determined at the end of this 

iteration is taken as the ultimate individual Vj(z). 

Chapter 4 uses the relationship between linear 

predictor coefficients, PARCORS , and average area functions 
to so that individual vocal tract area functions could be 
calculated from this Vjtz). 

Chapter 5 discusses the losses that occur in the 
vocal tract and how they have been taken into account in 
Va(z) of equation 1.4. 

The conclusion of the thesis is given in Chapter 


6 . 




CHAPTER 2 


HOMOMORPHIC DECONVOLimON 

The fundamentals of homomorphic deconvolution is 
briefly given below. The details may be seen from 
Oppeinheim and SchaferCSl. 

As described in Chapter 1 there are three basic 
classes of sounds corresponding to different forms of 
excitation of the vocal tract namely, Woiced sounds, 
Fricative /Uni/oiced sounds and the Plosive sounds. In each case the 
speech signal is produced by* exciting the vocal tract 
system with a wideband excitation. The vocal tract changes 
shape rather slowly with time, and thus can be modelled as 
a slowly time-varying filter that imposes its frequency- 
response properties on the spectrum of the excitation Cl, 31 

Fig 1.5 depicted a discrete-time model in which the 
samples of speech were assumed to be the output of a time 
varying di screte-t ime system that models the resonances of 
the vocal tract system. Since the vocal tract changes shape 
rather slowly in continous speech, it is reasonable to 
assume that the discrete time system in the model has fixed 
properties over a time interval on the order of 10 ms . ( In 
the present case it is considered 40 ms or more because it 
is sound of sustained vowel with unmodulated amplitude.) 
Thus the discrete-time system may be characterized in each 



29 


such tints interval by an impulse response or a frequency 
response or a set of oo-eff icients for an I IR system. 
Specifically, for voiced sounds, the transfer function of. 
the digital filter consists of vocal component represented 
by 





A 

(1 (1 - 


Nhere p s poles of V(z), 

Ck and Cft are the comlpex natural frequencies 
ofthe vocal tract . 

The poles V(z) correspond to the formants. 
Thus in fig 2.i the system function of digital filter is 

H(z)« G(z) .V(z) . . .2.2 

This filter is excited by a train of impulses pCnl in which 
spacing between impulses corresponds to fundamental ( or 

pitch > p€r.iod of the voicet22,24] . An amplitude control 

regi:'ates the intensity of the input to the digital filter. 

Homomorphic deconvolution can be applied to the 
estimation of parameters of the speech model if we assumed 
that the model is valid over short time interval [221 so 

that a short segment of length L samples of the sampled 
speech is thought of a convolution 

sCn]«gCnl«vCn)*p[n3 for 0< n <L-i ..2.3 
where vCnllgtnl is the impulse response of the vocal tract 
system and pCnl is periodic. The model of Eq. 2.3 is not 



PITCH PERIOD 



SYSTEM COEFFICIENTS 
(VOCAL TRACT PARAMETERS) 


FIG 2.1 DISCRETE TIME MODEL FOR VOICED SPEECH PRODUCTION 




31 


valid at tha adgas of tha Iniarval baoausa of tha pulsas 
that occur bafora tha baginning of analysis interval and 

>r 

pulsas that and after tha and of interval. Therefore, to 
mitigate the affect of discontinuities of the model at the 
end of intervals, tha speech signal sCnl can be multiplied 
by a MindoM win! that tapers smoothly to zeros at both 

ends. Thus the input to homomorphic deconvolution system is 
xCnl -= wCnlsCnl, wCnl =1 0^ n i N -1 ..2.4 

e 0 n< 0 
« 0 n> N 

In tha case of voiced speech, if wCnJ varies very 

slowly with respect to variations of vCnllgCnl C231, the 

analysis will be greatly simplified if we assume that 

xCnl = vCn3«g[n31PHCn] ..2.5 

where PN[n3 e w[n3p[n3 

Even if this assumption is not made, the 

detailed analysis leads us to same conclusion . C133 . 

Let us examine the contribution of complex 
cepstrum of each component of Eq 2.5. It is reasonable to 
assume that over short time interval of window, pCn3 is a 

train of equally spaced imp-ul'^cs of the form 

* 

0Cn3 fit n - kNo] ..2.6 

k-0 

where pitch period is ht and M periods are spanned 


by window. 



32 


From ttq 2.6 


PnEh} 


M -1 

X wt kNo 3 Si n-kKb ) 
k a 0 


. .2.7 


To obtain pnCn], we define a sequence 

w[kNo3, kaO, 1^ 

WM^Ck] a I 

0i otherwise, 

whcif-c Ffujrier Transform is (i.e, of pH[n3 ) 


2 , 


. . . M-1 . 

. . 2.8 


Ph ( e^ ) a *2^w[ kNo Je'^^a e^^* ) 

kaO 


. .2.9 


Thus, PH(e'*^) and PnEe'^) are both periodic with respect to 
period and complex cepstrum of PHtnl is 


ti^tn/H>3, n a 0, ±No, ±2I>^,... 

A 

PHtn] a 

0, otherwise. ..2.10 


where w a w( n.hfe ). 

The periodicity of complex logarithm resulting from 
the periodicity of the voiced speech signal is manifest in 
the complex speech oepstrum as impulses spaced at integer 
multiple of samples ( pitch period ) If the sequence 



33 


Wnq t ^ is minimunn phase , then PmCh] will be zero -for n<0 . 
Otherwise, PHtnl will have pulses spaced at intervals of No 
samples for both positive and negative values of n. In 
either case, the contribution of PwEnl to xtnl will be found 
in the interval |n|^ hfe . 

The complex cepstrum of vCnD can be obtained from 
complex logarithm of V(z); 

'^(z)=log[A3-5i^C 1 og[l-CicZ‘'‘3 + logC 1-Ck*z'^: > ..2.11 

it=i 

From the expression it is easily seen that 

n<0 , 

n=0, ..2.12 

n>0 

or if 

C|.= Icicl 

v[n]= 2 Cos n>0 ..2.13 

Irsi 

The glottal pulse, gCnl , is of finite duration 'and is 
generally assumed to be non-minimum phase sequence as tn 
gCn3=groirkCn3*gmaxCnl ..2.14 

The contribution of complex cepstra xtn] due to gCn] is 




34 


Smin C n 3 


O^n 


g[n3 


= { 


. .2.15 


Smixtn] n<0 

where from our previous discussion we 
expect that primary contribution of g[n] 
to xCn] would be in the region around n=0 . 

In general, the components of complex cepstrum vCn3 

A 

and gCnl decay rather rapidly, so that for a reasonable 
large values of No, the vocal tract and glottal pulse 
contributions do not overlap PMtnl, that is peaks of ^Cnl 
stand out firmly from v[n3«^’'lln other words in the complex 
logarithm, the vocal tract components are slowly varying 
and the excitation components are rapidly varying. This is 
illustrated by an example. 

2.1 AN EXAMPLE OF HOMOMORPHIC DECONVOLUTION OF SPEECH 

Fig 2.2(a) shows a segment of speech weighted by 
Hamming Window [23,33. The complex logarithm (magnitude and 
unwrapped phase)of the DFT of the signal in Fig 2.2(a) is 
shown in Fig 2.2(b). Note rapidly varying, almost periodic 
components due to Pninl and slowly varying components due to 
vtn3 and gCnl. These properties are manifest in the complex 
cepstrum of Fig 2.2(c). in the form of impulses at the 
multiples of apprximately 8 ms (the period of input speech 
segment) due to PkCh) and in the region |nT|< Sms which we 
attribute to vCn3 and g[n3. 

For speech sampled at 10,000 samples/sec, the 



Phase (rod) Log mognifude 


Input - 200 samples 


1 8 1 

2 16 20 2 


Time (ms) 


(0) 


Spectrum 


m 


I 

0 - 


0 400 80b 1200 1600 2000 2400 2800 3200 3600 4000 

Frequency (Hz) 



0 400 800 1200 1600 2000 2400 2800 3200 3600 4000 

Frequency (Hz) 

(b) 


FIG 2.2(a) SEGHENT OF SPEECH WIGHTED BY HAMMING WINDOW 

(b) COMPLEX LOGARITHM FO DFT OF SIGNAL IN (a); 

(c) COMPLEX CEPSTRUM OF PART (a). 










37 


pitch period No will range from about 25 samples for high 
pitch voiced upto 150 samples for low pitched voice. 

As previously explained, frequency invariant filter 
can be used to separate the components of the convolutional 
model of speech. Low pass filtering of complex logarithm 
can be used to recover the approximation to gCnl*vCn3 and 
high pass filtering for PwCnl. Fig 2.3(a) show a 256 samples 
of vowel sound. This segment was multiplied by Hamming 
window and complex cepstrum was computed using DFT . The 
complex cepstrum is shown in Fig 2.3(b). Fig 2.3(c) is an 
approximation of PnCnl obtained by applying to the complex 


cepstrum 

a 

symmetrical high 

pass frequency 

invariant 

filter. 

Fig 

2.3(d) 

shows 

approximation to 

gCn3*v[n] 

obtained 

by 

using a 

low pass 

frequency invariant filter. 


Finally to illustrate the validity of convolution, Fig 
2.3(e) shows the result of convolving the sequence of Fig 
2.3(d) with an impulse train of equal amplitude impulses 
occurring at locations of the peaks in Fig 2.3(e). As we 
see by comparing Figs 2.3(a) and 2.3(e), the reconstructed 

waveform is very close to the original. 

2.2 HOMOMORPHIC DECXINVQLUTIQN AR APR lED TO ACTUAL SPEECH 

The previous discussion has shown that Homomorphic 
Deconvolution can be successfully applied for recovery of 
Ctzl.V[z] from Ptzl.HCzl. A FFT Analyser in the laboratory 
was made use of for this purpose and the vowel utterred was 
\aviv. The pitch period was 6.7 ms as shown in the markings 




fo) ^ 




Time (ms) 


™ 2.3(a) A 


PwCn]; (d) RECOVrwc-n^^'r RECOVERED EXHTATir* 

ihpulse eesTo^ J 

ERiOD AS measured. 



39 


in Fig 2.4(a). The sampling frequency was 80 KHz. 536 
samples were obtained from the relation 


pitch period ' 
sampling interval 


or 6. 7*10'®*80tl0®= 536 


Fig 2.4(b) shows the recovered complex cepstrum of the 
sustained vowel. This H(z) was what we wanted to recover 
from P( 2 ).H(z). The H(z) so recovered was transferred to a 
PC through a GPIB interface for further separation of V;(z) 
and subsequent determination of individual vocal tract 


area functions. 



40 



FREOIJENCY 
100 KHz 
Sri 
► Z'O 
10 
S 

c 

1 

500 Hz 
200 
100 
50 
20 
10 


1 

pKaine time 
2SmZ 
SAMPLING 
Int 


print 


FIG 2.4^55)PITCH PERIOD OF SEGMENT OF ACTUALLY UTTERED 
rOWEL /aw/ 



20 log|o(exp(jw)) 


60.00 


40.00 


20.00 


0.00 


- 20.00 


• 40.00 


0.00 


FIG THE 



CHAPTER 3 


SUCCESSIVE ITERATION 


?) t GENERAL DISCUSSION 

The general expression for vocal tract transfer 
function as given by Eq. 1.4 wasCil 


N 


Va(2) = 


0.5*(i + rs) TT (l + ri()z 

>£1 

BTzl 


-N/2 


. .3.1 


where 


N 

D(z)=l-5;akZ 

k=l 


-k 


..3.2 


is a polynomial. It assumes the lossless tube and each 
section to be of equal length. D(z) can also be expressed 
as a polynomial in z given by the matrix 


IKz) = Ci' t 




r 



1 -ri 


1 -rN 


1 

-r^z-" z--^ 


-r^z"^ z"^ 


0 





In other words we say that transfer function has a delay 
corresponding to numbers of sections of model and has no 
zeros; only poles, which define resonances or formants of 
the model. In usual case rG=l (infinite impedance at 
glottis) . Polynomial D(z) can be found out using a 
recursion formula that can be derived from Eq. 3.3. The 
desired recursion formula becomes evident after evaluating 
first few matrix products. Let us define 



43 


Pi = [i -ij. 


-PiZ 


-1 


•r'l 

-1 


*= C ( i+ ), - ( ri+ z'^ ) ] 

-. 3.4 


If we define 

Di<z) = 1 + Piz'^ 

it can be easily shown that 

Pi(z) = [ Di(z), - z“^Di(z'^) 3 

Similarly, the row matrix Pg can be defined as 

i -rai 


. . 3.5 


.. 3.6 


P, = Pi 


-rgz 


-1 


. . 3.7 


If the indicated multiplication is carried out it is easily 
shown that 

P 2 * [D2(z), z'^ >] 

where 

D2(z) - Di<z) + r2Z**Di(z‘^) 

By induction it can be seen that 

-Pk 


.. 3.8 


.. 3.9 


Pk^Pk-i 

= [D»(Z) , 


1 


-»-i 

-rifZ -z 


- z'‘'Dv( z"^ ) 1 


where 


Dk<z)»= D^-iCz) + ritz'^D^-itz"^ ) 


3.10 


. . 3.11 


. 3.12 


Finally, the desired polynomial is 

rn 

D(z) =Pn 


= Dm<z) 


. . 3.13 


Thus we can see that it is not necessary to carry out all 
the matrix multiplication but we can simply evaluate the 



44 


recursion 

Do(z) = i . .3.14 

D*(2)»Dfc_i<2)+r„2‘'‘D„-i(z‘^ ), k = l, 2, N ..3.15 

D(2)=Dn(z) ..3.16 


The effectiveness of lossless tube model can be 
demonstrated by computing the transfer function for the 
area function data given by Fig. 1.6(a). 

3.1.1 The Procedure 

To determine the individual vocal tract transfer 
function VjCz) of our speech signal, reflection coefficients 
ri, from the average area functions, as given in Fig 1.6 were 
computed using the Eq 1.5. The equation cosiders the 
radiation load of a tube of area A^+i which has no reflected 
wave. The value A^+i is chosen to give reflection coefficient 
at the output. This was the only source of loss in the 
system. Therefore re was taken as 1 for infinite glottal 
impedance . Thus it is to be expected that A^+i will control 
the banwidth of the resonances of V(z). Number of sections 
of vocal tube were taken as 10. 512 point DFT was computed 

form relation 3.1 using pascal language programming to 
obtain ViCz) of the average area vocal tract transfer 
function. The results of frequency response were same as 
depicted in Fig 1.6Cc). Frequency response of our 
computation is shown in Fig 3.1. 

The Vj(z) so calculated was used for simple 
mathematical divsion of H(z) of our speech signal. This 





46 


resulted in emergence of average area glottal wave transfer 
function Gj(z). IDFT of this Ga<z) provided us with 
digitized gaCnl . This gaCnl has very high frequency components 
superimposed on its. This has already been shown in Fig 
1.15. Our aim was to approximate this wave shape to 
synthetic (or modelled) glottal wave shape of Fig 1.4 and 
remove this high frequency components which owed their 
presence to mismatch between average area and individual 
area functions. To arrive at such an approximation, it was 
imperative that values of na and ng of our individual aCnl 
be calculated. These values were determined by finding the 
Least Mean Squared difference between modelled gmCnl and 
gaCn] using the relation 

135 , 

E = 3inti3-gaCi3 ) • • 3.17 

i=0 

A range of n^ and ns values were chosen for gmCil. The values 
of na and ng which gave the least mean squared value E were 
chosen to represent synthetic g;Cn3 by three sets of 
equations 1.1, 1.2 and 1.3 given in chapter 1C27]. A point 

to note here is that 134 points has been chosen to 
represent one glottal wave cycle. This value 134 is 1/4^*^ of 
our 536 samples of our original speech data which was 
decimated in time. This decimation converts the original 
sampling frequency 80 KHz to 20 KHz. The 512 DFT 
computation of giCnl using Pascal programming produced OjCz). 
The transfer function H(z) when divided by this GjCz), 



47 


resulted in computation of individual vocal tract 
transfer function VjCz). 

This was the first step of our iteration method to 
find individual V;<z). This VjCz) then replaced the initial 
Va(z) and was again utilized to find another G^Cz) and so 
on. This process of iteration was repeated several times 
till very near approximation to average area glottal wave 
transfer function was achieved. 

The V;(z) which we finally got was what we desired 
for caloulation of individual vocal tract area functions. 
This Vi(z) is shown in Fig 3.2 as afrequency response of ID 
section individual vocal tube. In chapter 4 we will deal 
with extraction of individual area functions from this Vi(z). 



20 log^(exp(jw)) 



FIG 3.2 FREQUENCY RESPONSE OF INDIVIDUAL VOCAL TRACT 
AS COMPUTED FOR FOWEL f&w / 


CHAPTER 4 




VOCAL TRACT PARAt-gTERS 

After having obtained the transfer 

function for individual vocal tract, VjCz), the thesis 
addressed itself to extracting the individual vocal tract 
parameters from it. This phase required careful examination 
of the available techniques, their suitability, accuracy, 
speed of computation and reliabilty in our context. It was 
but natural that most powerful speech analysis technique 
was chosen. Linear predictive coding of speech signal was 
one such technique which met all our requirements. A brief 

introduction to this technique is given below. 

4 ■ 1 LINEAR F=>REDICTIVE CODING CF SPEECH SIGNAL 

4.1.1 Introduction 

The method of linear predictive coding of speech 
signal is one of the most powerful technique for analysis 
of speech signals. This is the most predominant technique 
for estimatimg the basic parameters of speech like vocal 
tract area functions. The basic idea behind the linear 
predictive analysis is that a speech sample can be 
approximated as a linear combination of past speech 
samples. By minimizing the sum of squared differences ( 
over a finite interval ) between the actual speech samples 
and the linearly predicted ones, a unique set of predictor 



50 


coefficients can be determined Cii. 

Since speech signal can be modelled as the output 

of linear, time varying system excited by quasi periodic 
pulses as in case of our sustained vowel, linear prediction 

method provides a robust, reliable and accurate method for 

estimating the speech parameters that characterise a linear 
time varying system. Without going into further details we 
will apply this technique for calculating the individual 
vocal .tract area functions. 

4.12 The technique as actually applied 

To get the vocal tract parameters from VjCz) 

obtained in chapter 3, it was a must that a set of linear 
predictor coefficients, «!< be found out first. This Vi(z) can 
be represented as 


Vi(z) * 


1 

Aj ( z ) 


..4.1 


where Ai(z) is a polynomial of the form 

Ai(z) =1— ^ ci](Z ..4.2 

obtained by linear predictor coefficients and which could 
also be obtained by recursion 

A°(z) = 1 -.4.3 

A*’’ ( z ) = a'’'^’<z) - kiz“W'“^’<z~^ ) . ..4.4 

A < z ) == A**"’ ( z ) -.4.5 

uhere k; is PARCOR coefficient. One can draw .analogy between 



51 


these equations ( 4.3, 4.4, 4.5 > and Eqs 3.14 ,3.15, 3.16. 
This will be more evident in section 4.2. Also Eq 1.4 and 
4.1 are more intimate to each other. 

4.1.3 Linear Predictor Coefficients 

Although the set of predictor coefficients, an, 
lik£p is often thought of as the basic parameters set of 
linear predictive analysis, it is straightforward to 
transform this set of coefficients to a number of other 
parameters set, like PARCORS to obtain vocal tract area 
functions [9,101. This is how we proceed now. 

The reciprocal of each point in VjCz) gave Ai(z). 

( Eq 4.1 ).The IDFT of this AjCz) computed through pascal 
programming determined its impulse response as 

aCn3=5Cn3-5;:«k 5Cn-k3 ..4.6 

k=l 

where Stnl is an unit impulse, an is linear predictor 
coefficient. The method of calculation of predictor 
coefficients is briefly explained as follows: 

Suppose there are 10 samples in atnl. Then 
for n=0, 

a [ 0 3=5 ( 0 ) -[ai5(0-l)+a25<0-2)+..ap5(0-p>] =1 ; ..4.7 

for n=l , 

a C 1 3 =S ( 1 ) -[ai5(0)+a25(l-2)+..ap5(l-p)] = -«! ..4.8 

similarly, for n=p, a[o3= — ap. ' --4.9 

; jVO*- » mw^***^^ 



52 


First value of predictor coefficient will be unity as 
illustrated above. Subsequent values of predictor 
coefficients came out to be negative of corresponding 
values of aCnl. ( Further details may be seen in [ 1 ). 

The number of values taken from aCn3 were first il 
since we wanted to take iO sets of predictor coefficients 
for our 10 section vocal tube. To make the first value a 
unity as per our above observations, and determine our 
subsequent values on the basis of this manipulation, we 
divided all the 11 values taken, by first value. In this 
way a set of 10 predictor coefficients was obtained from 
to 11^^. The next requirement was to obtain corresponding sat 
of 10 PARCOR coefficients from these predictor 
coefficients . 

4.1.4 PARCOR Coefficients 

Obtaining PARCOR coefficients was our next 
step for final results. PARCOR coefficients give us the 
ratio between areas of adjacent sections . A set of PARCORS 
could be obtained from LPC coefficients using a backward 
recursion of the form ( see ref [$1 ) 

ki = a;'*’ ..4.10 

_ (i-i) Sj 

a — : — 


l^j^ i-1 


. .4.11 



where i goes from p, to p-i , down to 1 and initially we set 

a/**’ = “j 1:^3 ip ..4.12 

In our case p=iO. That way a set of 10 PARCORS was 
calculated after setting initial condition as in Eq . 4.12, 

i.e., li^*’ value was 10^*^ PARCOR coefficient . 

4.2 THE FINAL RESULT 

With PARCORS obtained, the thesis was poised for 
final realization of objective. 

Comparing Eq 3.14 and 4.5, if for a p section tube, 
ri= — k;, then it is clear that 
D(z) = A(z) . 

Using relation r|,= it can be shown that area of the 

Ak+i+Ak 

i^'^section of the vocal tract can be found out from 

Ai+i = . .4.13 

1+ k; ' 

where A;+i is the value of the area taken from Fig 1.6<a) 
at the 11^^ section of the vocal tube. ESI. 

From this relation (4.13), we finally achieved what 
we had set our sight for. All the 10 values of individual 
vocal tract areas and their reflection coefficients were 
calculated. Fig 4.1 will show the individual are functions 
and their corresponding reflection coefficients. 




CHAPTER>5 


LOSSES IN VOCAL TRACT 
5 ■ 1 EFFECTS OF LOSSES 

As had been described earlier , in chapter i, 
speech coming out of lips suffers some energy 
losses in the vocal tract. These losses are 
(a) wall vibration loss, 

(b> viscous friction loss, and 
(c) heat conduction loss. 

We shall consider each loss in some details before finally 
considering the total loss effected by them in the tube. To 
consider the effects of these losses we will be tempted to 
return to basic laws of motion and accoustic propogation of 
waves. The effects have been analysed considering the vocal 
tube as accoustic model. 

5.1.1 Losses due to wall vibration 

Because of its mass, the air in the tube exhibits 
an inertance which opposes acceleration. Because of its 
compressibility, the volume of air exhibits a compliance. 
The variation of air pressure inside the tract will cause 
walls to experience a varying force. Thus , if walls are 
elastic, cross sectional area will change depending upon 
the pressure in the tube. Assuming the walls to be locally 
reacting [14,151, the area A(x,t) will be function of 
p(x,t), the pressure. Since the pressure variation is small 



56 


resulting variation can be treated as small perturbation of 
the nominal area, i.e, we can assume that 

A(x,t)=Ao<x,l)+SA(x,t) ..5.i 
where A(j(x,t) is the nominal area and S(x,t) is small 
perturbation. This is depicted in Fig. 5.i. Because of mass 
and elasticity of vocal tract wall [201, the relationship 
between area perturbation and pressure variation can be 
modelled by a differential equation. The details are beyond 
the scope of this thesis. Suffice to say the two are 
related by differential equations. 

To consider the effect of wall vibration in 
frequency domain representation of a time invariant tube, 
excited by a complex volume velocity source, FlanaganC43 
gave differential equations 


S P _ 
S X ~ 


Z U 


..5.2 


S U 
S X 


Y P +YhP 


..5.3 


where Yh is wall admittance, Y is accoustic admittance per unit 

length of vocal tube, Z is the accoustic impedance per unit 

length of vocal tube, and U and P are frequency domain 

representations of volume veloctiy and pressure 

repectively. These parameters can be calculated from the 



57 


aquat i ons 

Z(x ,A) = ■ 

Yix .Cl) - jft * — 

iOC^ 


..5.4 


. .5.5 


and 


Yr<x, a) 


jrtmM < X ) + bH(x) + 


kwCx) 


..5.6 


where mwCx), bM<x) and kwtx) are mass/unit length, damping 
/unit length and stiffness /unit length of the vocal tract 
wall respectively and 

p is the density of air in the tube, 
c is the velocity of sound in the air, and 
A is the radian frequency. 

Here it is to be noted that when Ao<x) is constant, the 
expression of Eqs 5.4 and 5.5 will reduce to 


Z = 



. .5.7 


Y jA ' s 
pc 

which are the expressions for lossless vocal 
The effect of wall vibration is 
high frequencies because of very little 
massive walls of vocal tube C23 . But 


■ 

.5.8 

tube C73 . 


negl igi Ible 

for 

motion of 

the 

they are 

most 



58 


pronounced at low -Prequancies . Hence any vocal tube must 
consider this loss. 

5.1.2 Viscous friction 'loss 

Viscous friction losses are proportional to the 
square of particle velocity. These losses are due to 
viscous friction between air flow and walls of vocal tube. 
The effect of viscous friction losses can be accounted for 
in frequency domain representation ( Eq 5.2 and 5.3 ) by 
including a real, frequency dependent term in the 
expreesion for the accoustic impedance, Z,C23 i.e. , 


zu,ci) = +jA - srTv r • 

where S<x) is the circumference of the tube, 
fl is the coefficient of friction. 

5.1.3 Heat conduction loss 

Heat conduction losses in the smooth and hard 

walled tube are proportional to the square of sound 

pressure [21. As in the case of viscous friction losses, 
the effects of heat conduction through the vocal tract can 
likewise be accounted for by adding a frequency dependent 
term to accoustic impedance Y(x,n); i.e., 



. .5.10 


where Cp is the specifc heal at constant pressure 



59 


T) is the ratio of specific heat at constant 
pressure to that at constant volume, and 
X is the coefficient of heat conduction. 

Both, the heat conduction loss and viscous friction 
loss are only applicable to smooth, rigid vocal tube. The 
vocal tube is neither CIS, 203. To reiterate, therefore, as 
we had said earliar in section 1.2.4 the effects of viscous 
friction and thermal conduction at the walls are much less 
pronounced than effects of wall vibration for low 
frequencies . 

Having elaborated the energy losses in vocal 
tube, it is possible to approximate these losses or 
transmission characteristics as we call them in electric 
circuit terminology, in a it network section as shown in Fig 
5.1. We shall see in the next section how all these losses 

can be associated with our work. 

5 . 2 APPRQXIMATIt3N TO LOSSLESS VOACL TUBE 

The total loss in the Fig 5.1 can be replaced with 
a single comlpex source Z. From this Z, the reflection 
coefficients for each section of the average vocal tube can 
be calculated from the expression 


= 


Z;4i - Z; 

Zi+1 + Zi 


..5.11 


and from these reflection coefficients the average area 
transfer function Vj(z) of Eq. 1.4 could be calculated and 




60 


the whole process of successive iteration as described in 
preceding chapters can be applied. The only difference is 
that the reflection coefficients will be complex quantity 
rather than real as obtained in Eq i.5. 

However, one may be be tempted to believe that if 
these losses are to be accounted for, the transfer function 
Va<z) of average area vocal tract will be considerably 

A 

different from what we had calculated earliar. This 
apprehension is not true. Because the yielding walls tend 
to raise the resonant frequencies while viscous and thermal 
losses tend to lower them. The net effect is slight, very 
slight upward shift for lower resonant frequencies, as 


compared 

to 

the loss less 

tube^ rigid walled tube 

model . 

That is 

why 

the model 

which had been chosen 

by 

us to 

analyse 

our 

speech is 

a 

good representation 

of 

sound 


transmission' in vocal tract. 







CHAPTER 6 


CtJNCLUSIQN 

The thesis has focussed its attention upon three 
main areas namely, the human mechanism and physics of 
speech production for voiced sounds, separation of glottal 
wave and vocal tract transfer functions from their 
superimposed frequency spectrum and lastly, the extraction 
of area functions of individual vocal tract, A discrete 
time model of speech was first made after the explanation 
of complete process of speech production including the 
losses. This model formed the bedrock of our thesis work. 

The components of speech spectrum, G(z) and VCz) 
were separated from each other through successive 
iteration. The results were shown in the form of figures 
which have been attached with each phase of work done. 

Finally, the ultimate realisation of the area 
function of individual vocal tract was achieved with 
significant success. The thesis work was tried for several 
sustained vowel utterrences on some cases who had 
difficulty in utterring these vowels. The determination of 
the area functions of their vocal tract helped in 

diagnosing the disorders in their vocal tract. 

SCOPE FOR FUTURE WORK 

The thesis work has great scope for future work in 
diagnosis of disorders in any individual's vocal tract and 



63 


speakers identification and his secrecy. Presently the 

lo ith 

diagnosis of faulty speech utterences is going on^ great 
vigour and speed. In fact this);®one of the latest field in 
biomedical engineering where engineering skill is being 
amalgamated with medical science opening new vistas for the 
people severely handicapped by natural disorders. 

A 

Another very important application of the thesis in 
future work is on speaker's identification ( individual 
area function of vocal tract is "personal" ) and his 
secrecy. Enormous amount of work is going on particularly 
in defence research organizations on this subject. This 
application for defence forces needs no elaboration. The 
line of approach adopted is as described in the thesis. 



64 


REFERENCES 


[13 Lawerence L Rabiner and Ronald W Schafer, "Digital 
Processing of Speech Signals", Prentice Hall, Inc., 
Englewood Cliffs, New Jersey. 

[23 James L. Flanagan, "Speech Synthesis and Perception", 
Springer-Verl eg , 2"*^ edition. New York, 1972. 

[33 A.V. Oppeinheim and R.W. Schafer, "Digital Signal 
Processing", Prentice-Hall, Inc, Englewood-Cl if f s , New 
Jersey, 1975. 

[43 G.Fant, "Accoustic Theory of Speech Production", 

Mouton, The Hague, 1970. 

[53 Chiba and M. Kajiyama, "The Vowel its Nature and 
Source", Phonetic Society of Japan, 1958. 

[63 M.M. Sondhi , "Model for Wave Propagation in Lossy 
Vocal Tract", J. Accoust . Soc . of Am., Vol . 55, No. 5, PP 

1070-1075, May 1974. 

[73 M.M. Sondhi and B. Gopinath, "Determination of Vocal 
Tract Shape from Impulse Response at Lips", J. Accoust. 

Soc. of Am., Vol 49, No. 6, (Part 2), PP 1867-73, June 
1971 . 

[83 H. Wakita, "Direct Estimation of Vocal Tract Shape by 
Inverse Filtering of Accoustic Speech Waveforms", IEEE 
Trans, on Audio and Electro acoustics., Vol AU-2i , PP 417- 
427, Oct 1973. 

J.D. Markel and A.H. Gray, Jr, "Linear Prediction of 



65 


Speech", Spri nger-Verl og , New York, 1976. 

Ci03 J. Makhoul , "Liner Prediction: A Tutorial Review", 
Proc. IEEE, Vol . 63, PP 561-589, 1975. 

Cll] B.S Atal and S.L Hanauer, "Speech Analysis and 
Synthesis by Linear Prediction of Speech Wave", J. Accoust. 
Am., vol 50, PP 637-655, Aug. 1971. 

tl23 Wolfgang Hess, " Pitch Determination of Speech 
Signals", Spri nger-Verl og , New York, 1983. 

C133 Verner Verhelst and Oscar Steenhaut , "New Model for 
Short-Time Complex Cepstrum of Voiced Speech", IEEE Trans 
on Accust., Speech and Signal Prosessing", Vol ASSP 34, no. 
1, Feb. 1986 

[143 M.R. Portnof f , "A Quasi one Dimensional Digital 
Simulation for the Time Varying Vocal Tract," M.S. Thesis, 
Deptt. of Elect. Engg . Engr., MIT, Cambridge, Mass., June 
! 973. 

[153 P.M. Morse and K.U. Ingard, "Theoretical Accoustics", 
McGraw-Hill Book Co., New York, 1968. 

[163 K. Ishizaka and J.L. Flanagan, "Synthesis of Voiced 
Sounds from a Two Mass Model of Vocal Tract", Bell Syst. 
Tech. J., Vol. 50 No. 6, PP 1233-1268, July-August 1972. 

[173 B.S. Atal, "Determination of Vocal Tract Shape 
Directly from Speech Wave", J. Accoust. Soc. Am., Vol 74, P 
64, Jan 70. 

[183 Man Mohan Sndhi and J.R. Resnick, "Inverse Problem 
for Vocal Tract: Numerical Methods/ Accouslical 



Experiments, and Speech Synthesis", J. Accoust. Soc . Am., 
Har. 83, Vol 73, No. 3, PP 985-1002. 

[193 Takiya Koizumi, Shuji Taniguchi and Seijiro 
Hiromitsu, "Glottal Source Vocal Tract Interaction", J. 
Accoust. Am., Vol 78, No. 5, PP 1541-47, Nov. 1985. 

[203 Paul Milenkovic, "Accoustic Tube Reconstruction from 
Non-Causal Excitation", IEEE Trans, on Accoust. Speech and 
Signal Processing", Aug. 1987, P 1089. 

[213 M. Rothenberg, "Accoustic Interaction between Glottal 
Source and Vocal Tract', in vocal physiology Tokyo, 
UP, Tokyo, Japan, PP 305 328. 

[223 A.V. Oppeinheim and R.W. Schafer, "Homomorphic 
Analysis of Speech", IEEE Trans on El ectroaccoust . , Vol AU 
16, PP 221-226, June 1968. 

[233 J.M. Tribolet, T.F. Quatieri and A.V. Oppeinheim, 
"Short Time Homomorphic Analysis', IEEE Int. Conf . 
Accoust., ASSP, 1977, PP 716-722. 

[243 A.M. Nole, 'Cepstrum Pitch Determination', J. 
Accoust. Soc. Am., Vol 41, PP 293-309, Feb. 1967. 

[253 A. V. Oppeinheim ," Discrete Time Signal Processing", 
Prentice-Hall, Inc., Englewood Cliffs, New Jersey. 

[263 J. G. Graeme, G. E. Tobey, L. P. Huelsman, 
"Operational Amplifiers", Mc-Graw Hills, 1971. 

[273 A.E. Rosenberg, "Effect of Glottal Pulse Shape on the 
Quality of Natural Vowels". J. Accoust. Soc. Am., Vol 49, 
No. 2. PP 583-590, Feb. 1971. 



